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Abstract 

Information, stored or transmitted in digital form, is often structured. Individual data 
records are usually represented as hierarchies of their elements. Together, records form larger 
structures. Information processing applications have to take account of this structuring, which 
assigns different semantics to different data elements or records. Big variety of structural 
schemata in use today often requires much flexibility from applications — for example, to 
process information coming from different sources. To ensure application interoperability, 
translators are needed that can convert one structure into another. 

This paper puts forward a formal data model aimed at supporting hierarchical data pro- 
cessing in a simple and flexible way. The model is based on and extends results of two classical 
theories, studying finite string and tree automata. The concept of finite automata and regular 
languages is applied to the case of arbitrarily structured tree-like hierarchical data records, 
represented as "structured strings." These automata are compared with classical string and 
tree automata; the model is shown to be a superset of the classical models. Regular grammars 
and expressions over structured strings are introduced. 

Regular expression matching and substitution has been widely used for efficient unstruc- 
tured text processing; the model described here brings the power of this proven technique to 
applications that deal with information trees. A simple generic alternative is offered to replace 
today's specialised ad-hoc approaches. The model unifies structural and content transforma- 
tions, providing applications with a single data type. An example scenario of how to build 
applications based on this theory is discussed. Further research directions are outlined. 

Categories and subject descriptors: E.I: Data Structures — Trees; F.4.3 [Mathematical 
Logic and Formal Languages]: Formal Languages — Classes defined by grammars or au- 
tomata; 1.7.2 [Document and Text Processing]: Document Preparation — Markup lan- 
guages; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval — 
Query formulation 

General Terms: Theory, Languages 

Additional Key Words and Phrases: data model, hierarchy, information structure, regular 
expression, tree automaton 



1 Introduction 

Information processing has always faced the need to take into account the structure of the data 
being processed. Structuring of information plays an important role in fostering automated, com- 
puterised data capture, storage, search, retrieval, and modification. For example, an unstructured 
bibliographic reference like 'Bourbaki, N. Lie Groups and Lie Algebras' requires either a human 
assistance or the use of heuristics to determine whether Lie is the name of the author or a part of 
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the title. On the other hand, dividing that reference in two parts — 'author' and 'title' — from the 
very beginning would have solved this problem. The 'author' part can be further sub-structured 
into 'last name', 'first initial' and so on. 

The structure of data records depends on the kind of information the records carry — for exam- 
ple, flights schedule, asset list, book, e-mail message, and so on. It is often necessary to convert 
data between different structured representations, along with more basic tasks such as retrieval of 
certain components of a record or addition of a new component. 

Hierarchical (tree-like) way of organising information is very popular and convenient, because 
it allows aggregation of details at different granularity levels. Most of the data structures used 
today either are already hierarchical, or can be expressed using an information tree [1]. XML [2] is 
being increasingly used as the standard language for representing hierarchically structured data. 
Areas where structured information is actively utilised include: 

• text processing (markup languages); 

• information retrieval (document processing, query adaptation); 

• compilers (syntax trees) ; 

• library automation (bibliographic records); 

• a wide variety of industrial applications. 

This paper puts forward a simple and flexible formal data model for manipulating structured 
information. This model is built on top of a combination of results from finite automata and 
tree automata theories. The concept of automata and regular languages is applied to the case of 
arbitrarily structured tree-like hierarchical data records, represented as "structured strings." The 
paper compares these automata with classical string and tree automata, showing that the theory 
presented here is a superset of the classical models: everything that can be done with finite string 
or tree automata can also be done in our model. 

The data manipulation model suggested here is based around tree regular expressions, their 
matching and substitution. Regular expressions are widely used in non-structured text process- 
ing, serving as a core model in many text processing applications [3]. They are good for select- 
ing fragments that match certain patterns in certain contexts. For example, regular expression 
'Figure +([0-9]+)' selects all figure numbers in a text document (more precisely, it selects all 
decimal numbers that follow the word 'Figure', separated by at least one space). 

This paper brings the power of regular expressions to serve hierarchical data manipulation. 
The notion of regular expressions is extended to information trees in a way that matches most 
applications' needs. This offers a single, simple generic solution to many problems where different 
incompatible ad-hoc approaches have been used before. Although the model is described from 
a theoretical standpoint, its practical applications are considered, drawing on the experience of 
using regular expression matching techniques on plain text documents. There is room for further 
research in terms of both theory and applications. 

The rest of the paper is organised as follows. The next section surveys related works and 
discusses approaches taken previously. Section 3 introduces the data model proposed for rep- 
resentation of information trees, and Section 4 provides its formal definition. Then, Section 5 
defines finite tree automata that operate on information trees, and discusses properties of such 
automata. Section 6 introduces regular grammars and expressions and shows their equivalence 
to the finite tree automata. Section 7 presents an example application scenario to illustrate how 
the described theory can be used in practice. Section 8 concludes the paper and outlines further 
research directions. 
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2 Related works 



2.1 Murata's forest algebra and tree automata theory 

The most recent research on formal tree-structured data models belongs to Makoto Murata, 
who applied and extended the theory of tree automata [4] to the problem of transformation of 
SGML/XML documents [5, 6, 7]. In [8] Murata offers a hierarchical data model based on tree 
automata theory and the work of Podelski on pointed trees [9]. However, Murata uses tree au- 
tomata for purely representational purposes, as a means to formally define XML schema in his 
model, rather than as a main processing tool. 

The theory of tree automata [4] studies classes of trees, called languages, in a way similar to 
the formal language theory. A tree language is defined by its syntax, which can, by analogy with 
the language theory, be described by either a tree grammar or a tree automaton. Murata shows 
the close relation between SGML or XML document syntax (usually referred to as Document Type 
Definition, or DTD) and the syntax of its tree representation. Because the classical theory only 
deals with tree languages with limited branching (the maximum number of branches that any 
node of any tree of a language may have is a constant determined by the syntax of that language), 
Murata had to extend the theory to handle unlimited branching, necessary to represent marked-up 
documents. 

In [6], Murata suggests a data model for transformations of hierarchical documents. Although 
primarily intended to serve the SGML/XML community, the model is a generally applicable forest- 
based model. Murata's extended tree automata are used for schema representation, parallel to 
DTDs in SGML and XML. The core of the model is a forest algebra, containing fourteen operators 
for selecting and manipulating document forests. This algebra can select document fragments 
based on patterns (conditions on descendent nodes) and contextual conditions (conditions on non- 
descendent nodes, such as ancestors, siblings, etc.). One of the strong points of Murata's data 
model is that during transformation of documents, syntactical changes are tracked in parallel: 
operators apply not only to the document being processed, but also to its schema, so that at any 
stage in a transformation process, the schemata of the intermediate results are known. 

2.2 XPath 

A different approach to transformations of hierarchically organised documents is taken by the 
World Wide Web Consortium in their XPath [10] and XSLT [11] specifications. Although designed 
specifically for the XML document representation format, these recommendations are applicable 
to a wide variety of other types of documents that can be reasonably translated into XML. The 
XML Path Language (XPath) provides a common syntax and semantics for addressing parts 
of an XML document — functionality used by other specifications, such as XSL Transformations 
(XSLT). XPath also has facilities for manipulating strings, numbers, and Booleans, which support 
its primary purpose. 

XPath's model of an XML document is that of an ordered tree of nodes, where nodes can 
be of seven types. The multitude of types is needed to support various XML features, such as 
namespaces, attributes, and so on. XPath can operate on documents that come with or without 
a Document Type Definition (DTD). A DTD, when supplied, unlocks some functionality of the 
XPath processor, such as the ability to find unique IDs of document elements, or to use default 
attribute values. 

XPath is an expression-based language. Expressions evaluate to yield objects of four basic 
types: node-set, Boolean, number (floating-point), and string. Expressions consist of string and 
numeric constants, variable references, unary and binary operators, function calls, and special 
tokens. The specification defines core function library that all XPath implementations must sup- 
port. The library contains 7 node-set functions, 10 string, 5 Boolean, and 5 numeric functions. 
Some of the XPath operators and functions, such as +, -, floor, string-length, concat, are of 
general purpose nature and are typical of traditional programming languages. 
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3 Hierarchical data model: An informal introduction 

The basic data model that wc use in this paper was originally introduced in [1]. It was also shown 
there that all popular information structuring methods can be realised using tree-like structures 
and expressed by this model. We give here its brief description. 

Informally, in the proposed model a document is represented as a finite ordered labelled tree. 
Each node of the tree is associated with a label: a string over an alphabet (see Figure 1). In 
the traditional terminology, the labels of leaf, or terminal, nodes of a document are called data 
elements. They carry the "actual" content of the document. The labels of internal (non-leaf) 
nodes are referred to as tags, whose purpose is to describe those data elements. When speaking 
of tags and data elements, we shall often make no distinction between nodes and their labels. 




Figure 1: Labelled tree over alphabet £ 

Tags form the structure of a document, specifying the semantics of their underlying sub-trees 
(and, ultimately, the data elements). The sequence of tags from the root to a data element is 
called a tag path and fully identifies the properties and interpretation of that element. Sometimes 
tag paths are used as keys to extract data elements from documents. 

This model has the following properties: 

• Unlimited branching: although all trees are finite, the number of children a node may have 
can be arbitrarily large. By contrast, in the tree automata theory branching is limited and 
is determined by the tree's ranked alphabet. 

• Unlimited number of possible labels: labels are selected from an infinite set of strings over 
a finite alphabet. 

• Tags and data elements, which are traditionally regarded as belonging to different domains, 
are built here uniformly from the same alphabet. This allows the use of the same operators 
and mechanisms for both leaves and internal nodes. Information contained in data elements 
and tags is freely interchangeable. 

• This model can be used for string manipulations, whereby strings are represented as single- 
node trees. 

The last property suggests that traditional string operators could probably be extended to the 
tree case. In other words, there is an opportunity to design a good tree algebra in such a way that 
restricting it to single node trees would result in a meaningful and convenient string algebra. 

4 Model definition 

This section presents a formal definition of the model described above. We start it by introducing 
the terminology used in the rest of the paper. 
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• E is a finite set of symbols, called alphabet. For notational convenience, we assume that E 
does not contain angle brackets and slash: (, ), / £. 

• £* is the free monoid on S, or the set of all strings over S together with the concatenation 
operator. 

• e is the empty string; e G £*. 

All the examples given below are based on the alphabet of Latin letters. 

4.1 String trees 

We introduce string trees over an alphabet E as strings with angle brackets such that (a) brack- 
ets match pairwise, and (b) each whole tree is enclosed in a pair of brackets, for example: 
(ab(cde)f(g(hi))). Visually similar to the tree representations found in classical literature [12] 
and in recent research [6], string trees bear one significant difference. Traditionally, each symbol 
marked a separate node; brackets contained all the children of the node marked by the symbol 
immediately preceding the opening bracket. In our model, each node is labelled by a sequence of 
symbols, enclosed in a pair of brackets. 

The set T(E) of string trees over E is defined as the minimum subset of (E U {(, }})* such that: 

1. (} G T(E) (the null tree) 

2. (a) G T(E) for any a G E 

3. (xy) G T(E) for any (x), (y) G T(E) (concatenation) 

4. (t) G T(E) for any t G T(E) (encapsulation). 

It follows immediately from this definition that T(E) forms a monoid with respect to concatenation 
(from 1 and 3) and that all strings from E*, enclosed in a pair of angle brackets, are contained in 

Because concatenation in T(£) is defined differently from the usual string concatenation in 
(S U {(, }})*, it will be denoted as a centered dot (•). For example, if u, v G T(E), then u ■ v is 
their tree concatenation and uv is their string concatenation. We could, of course, discard the 
outermost pair of angle brackets from trees in T(E): they are present in all trees anyway. This 
would eliminate the difference between concatenations of trees and strings, making T(£) a sub- 
monoid of (E U {(,)})*. However, this would also complicate automata and grammars on string 
trees. 

We shall often encounter simple cases of trees, consisting of just one label, such as (cetus), and 
their subset, single-symbol trees, such as (a). The following notation will be used: 

• (E) is the set of all single-symbol trees from T(E): {(a) | a G E}; 

• (£*) is, by analogy, the set of all single-string trees: {(s) \ s G £*}. 

A matching pair of angle brackets can be thought of as a unary operator, which we called 
encapsulation (denoted as ()). Note that encapsulation and concatenation operators are free from 
relations (apart from the associativity of concatenation), so they freely generate T(£) from (E). 
Thus, T(E) can be called a free monoid with unary operator on (E). 

Note that there are two different structures denoted by T(E): strings over E U {(, )} and trees 
over E. Strings possess just one binary operator, concatenation, which has no symbol. Trees have 
two operators, concatenation and encapsulation, denoted by • and (). In the rest of the paper, 
we shall be using this dual notation, where an element of T(E) can be interpreted as a string or 
a tree depending on the context. The context is uniquely identified by the operator signs used. 
In the text, elements of T(E) will always be called trees, to help distinguish them from arbitrary 
strings from the larger set (E U {(, )})*. 
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4.2 Reduced string trees 

The above definition of string trees is, however, too broad for the informal data model described 
in Section 3. Let us consider the correspondence between the two. 

A tree from T(E) can be uniquely represented as {sotiSit 2 s 2 ■ ■ -t n s n ), where n > 0, Sj £ £*, 
and ti £ T(E). The root label of this tree is s^si • • • s n , and the children of the root node are the 
trees t\, t 2 , ■ ■ ■ , t n . 

It is easily noticeable that the same tree in the informal model can correspond to different trees 
in the formal model. For example, the following are two different versions of the tree depicted in 
Figure 2: (name(first(Joe))(last(Bloggs))}, (na(fir(Joe)st)m((Bloggs)last)e). 



name 




first last 

Joe Bloggs 

Figure 2: A tree example 

However, a one-to-one correspondence can be achieved by the following commutativity relation 
in T(E): 

(a) ■ (t) = (t) ■ (a) for any a £ E, t £ T(E). 

This relation defines the monoid of reduced trees. This monoid is an exact match for our informal 
data model, because the ordering of label symbols in relation with sub-trees no longer matters: 
all symbols can be collected in one part of the label and all sub-trees can be gathered in the other 
part. 

Reduced trees can be constructed as the image of the following map r : T(E) — > T(E), recur- 
sively defined as follows: 

r(u ■ (ii) • iti • (t 2 ) ■ u 2 ■ . . . ■ (t n ) ■ u n ) = u ■ U! ■ . . . ■ u n ■ (r(h)) ■ (r(t 2 )) ■ (r(t n )), 

where n > 0, Uj G (E*), and ti £ T(E). This can also be written in string notation: 

r((s t 1 s 1 t 2 s 2 ■ ■ ■ t n s n )) = (s si ■ ■ ■ s n r(ti)r(t 2 ) ■ ■ ■ r(t n )), 

where n > 0, Sj £ E*, and ti £ T"(E). The binary operator (concatenation) in Imr is naturally 
defined as 

r(u) ■ r(v) = r(u ■ v). 

This operator is well-defined (that is, the result does not depend on the choice of u and v). Indeed, 
let us consider u, u', v, v' such that r(u) — r(u') and r(v) = r(v'). Then, applying the definition 
of r to r(u-v) and r(u' ■ v'), we get r(u-v) — r(u) ■ r(v) = r(u') -r(v') = r(u' ■ v'). The associativity 
of concatenation in Im r follows immediately from the associativity in T(E). Thus, r(T(E)) is a 
monoid and r is a homomorphism from T(E) onto r(T(E)). Note that despite r(T(E)) being a 
subset of T(E), it is not a sub- monoid. 

Although reduced trees do provide a better match for actual real-life hierarchical data struc- 
tures, normal string trees can in fact be more useful because they give more flexibility in data 
manipulation. An actual transformation engine can easily convert from reduced trees to normal 
trees and back if it chooses to work with normal trees internally. 
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5 Finite automata 



The notion of finite automata comes from different branches of computer science. A finite au- 
tomaton is a machine that has a finite set of states, can accept input from a finite set of input 
symbols, and changes its state when input is applied. The new state depends on the current state 
and the input symbol. 

In the formal language theory finite automata are used to describe sets of strings called regular 
languages. A string is accepted by an automaton if, having consumed all the symbols of the string 
one by one, the automaton ends up in a predefined final state. A set of all strings accepted by an 
automaton is called a regular set (or regular language). The automaton is said to recognise this 
language. 

Similarly, in the tree automata theory, tree-like structures are operated on by automata which 
take symbols in tree nodes as inputs. There are two classes of tree automata. A "bottom-up" 
automaton starts at the leaves and moves upwards, while a "top-down" automaton descends from 
the root of the tree. Languages recognised by tree automata constitute the class of regular tree 
languages. 

Let us now introduce finite automata that operate on string trees in the bottom-up manner. 
The following definition essentially presents a mixture of the corresponding notions from the formal 
language theory and tree automata theory. 

Definition 1. A Non-deterministic Finite String Tree Automaton (NFSTA) over S is a tuple 
A = (E, Q,Qf, qo, A), where 

• Q D E is a finite set, called set of states, or state set; 

• Qf Q Q is a set of final states; 

• qo € Q is the initial state; 

• A is a set of transition rules of the form (q\, q 2 ) — ► <?3, where qi,q 2 ,qs £ Q- 

A can also be thought of as a subset of Q 3 , however interpreting it as a set of rules of the above 
form is more intuitive. 

Definition 2. A Deterministic FSTA (DFSTA) is an NFSTA whose A contains at most one rule 
for each left hand side (qi,q 2 ). In a DFSTA, A can also be considered as a partial function 
A : Q x Q — > Q. A DFSTA whose A- function is defined on all Q x Q is called complete. 

The operation of a deterministic FSTA can be illustrated on our informal data model, intro- 
duced in Section 3 above, as follows. The automaton starts at the leaves. Each leaf is processed 
like traditional finite automata do. The automaton starts in the state qo and takes the first symbol 
from the leaf's label. Because that symbol belongs to E, it also belongs to Q. The automaton 
finds the rule in A whose left-hand side matches the state and the input symbol. The right-hand 
side of that rule becomes the new state. Next input symbol is then taken from the label and the 
process continues. If at some stage no rule can be applied, the run is considered unsuccessful: the 
tree is not accepted. 

When a leaf's label is successfully processed, that leaf is cut off. The resulting state of the 
automaton is inserted into the leaf's parent node label (the exact place in the label will be discussed 
later). As soon as all children of a node have been processed, the node itself becomes a leaf, and 
the automaton runs again. Note that the definition above allows an automaton to accept its own 
state as an input. If, after processing the root label, the automaton finishes in one of the "final 
states" (Qf), its run is considered successful. 

A non-deterministic FSTA is different from a DFSTA in that it can switch to different states 
given the same current state and input symbol. An NFSTA can, therefore, have different runs on 
the same tree. If at least one of the possible runs is successful, the tree is accepted. 

Similarly to the string language and classical tree cases, deterministic and non-deterministic 
automata on string trees are equivalent: a language that is recognised by an NFSTA is recognised 
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by some DFSTA and vice versa. The proof of this and some other basic facts about automata will 
be given in Section 5.4 after a more formal definition of an FSTA run and recognisable string tree 
languages is presented. 

In order to formally define the run of a string tree automaton, we need to find a way to associate 
current states with all the labels. To do this, we shall put the state in front of each label, separated 
by a special symbol, denoted as a slash (/). For example, (a be) will be transformed to (go/a be), 
after which the automaton will consume a be step by step, changing the state symbol before the 
slash. Thus, intermediate trees will all belong to T(Q U {/}). Note that T(S) C T(Q U {/}), 
because £ C Q. 

Definition 3. Let A = (£, Q, Qf, qo, A) be an FSTA; assume for convenience that Q does not 
contain slash: / $ Q. Let also a, 6, c <G Q; r, I <E (Q U {(,)>/})*> and s G Q*. The move relation 
\-%- between two trees from T(Q U {/}) is defined as follows: 

(a) l{s)r l(qo/s)r (initial state assignment) 

(b) l{a/bs)r \-^- l(c/s)r if (a,6)^c 6 A (horizontal move) 

(c) l{ a /) r he l ar (vertical move) 

\-%- is the reflexive and transitive closure of h^- . A tree t G T(S) is accepted by the automaton 
A if there exists qf € Qf such that t \-%- (qf/). An empty tree is therefore accepted if and only if 

<?o e Q f . 

In the definition of the move relation, a tree from T(Q U {/}) contains automaton's both input 
and state. The role of slash is actually to mark those labels to which step (a) has already been 
applied. 

As described above, an FSTA can make three different kinds of steps. Step (a) — initial state 
assignment — applies once to each leaf label (because s cannot contain angle brackets or slashes). 
Then, step (b) — horizontal move — transforms a label according to the rules from A by consuming 
one symbol and changing the state. Finally, labels which have been fully processed by step 
(b) are cut off and their final states are inserted into their parent labels in step (c) — vertical 
move. Eventually, intermediate nodes lose their descendants and become leaves, making themselves 
available for step (a) and so on. The process stops at the root (or when there is no suitable rule 
in A). 

Example 1. Let X = {a,b}. The following automaton A = (Y,,Q,Qf,qo,A) accepts only trees 
whose labels (all of them) are composed of the same letter, either a or b. 







(5o, a) 


a, 


Q = {a,b,<7o} 

Qf = Q 


A= < 


(a, a) 
(<?o, b) 
. (b,b) 


: • 

-► b 

J 



Consider the tree (((a))a(aa)). This tree is accepted by A, as illustrated below by (slightly abbre- 
viated) one of its possible runs: 

a a a a 

| aa |4- e q Q /aa Hr | a / a he | a / he 

a <?o/a a/ a/ 

a aa aa aa 

a /X a/ / * q Q // * a/ ^ 



aaa q /aaa ^- a/aa ^- a/a ^- a/ 
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It is intuitively understandable that this automaton does not accept trees containing mixed 
symbols: because of the way symbols are propagated, a mixed tree would eventually "resolve" 
to the point where a leaf would contain different symbols. A move of the automaton on that 
leaf would require a rule with (a,b) or (b, a) in its left-hand side, but A contains no such rule. 
A complete proof of this statement is not significant for the further discussion and is therefore 
omitted. 

5.1 Generalised automata 

Like with deterministic automata being a case of more general non-deterministic automata, the 
latter can be reasonably generalised even further. In an NFSTA, non-determinism is only present 
in step (b) — horizontal move. Therefore, it seems natural to extend non-determinism to the two 
other steps as well: 

• Initial state assignment (a) can be generalised by allowing a set of possible initial states Qo, 
rather than a single initial state qo- 

• Vertical move (c) can be governed by a set of rules 7, which (non-deterministically) map one 
state to another during the move. 

It happens that such generalisation does not increase expressive power: all three kinds of string 
tree automata recognise the same class of languages, as will be shown in Section 5.4. Because 
generalised automata are somehow cumbersome to deal with, and because they are not as useful 
as their more restrictive counterparts are, in our further study we shall be primarily dealing 
with "simple" FSTAs. However, the concept of a generalised automaton will be indispensable for 
proving the equivalence of automata and tree regular grammars in Section 6.1. 

Definition 4. A Generalised Non-deterministic Finite String Tree Automaton (GNFSTA) over 
an alphabet E is a tuple A = (E, Q,Qf, Qo, A, 7), where 

• Q 2 E is a finite set, called set of states, or state set; 

• Qf Q Q is a set of final states; 

• Qo Q Q is a set of initial states; 

• A is a set of "horizontal" transition rules of the form (q\, q 2 ) — > q$, where q\, q 2 , qs G Q; 

• 7 is a set of "vertical" transition rules of the form q\ — > q 2 , where qi,q 2 G Q. 

It is often more convenient to treat 7 as a function 7 : Q — > 2®, so that 7(9) denotes the set of 
states q can be transformed to during a vertical move. This is the notation that will be used in 
the rest of the paper. 

Definition 5. Let A = (E, Q,Qf, Qo, A, 7) be a GNFSTA; as usual, we assume that / & Q. Let 
also a, 6, c G Q; r,l G (QU{(, ),/})*; and s G Q*. The move relation \-^- between two trees from 
T(Q U {/}) is defined as follows: 



(a) 


l(s)r 


hr l(a/s)r 


if a G Qo 


(initial state assignment) 


(b) 


l(a/bs)r 


hr l(c/s)r 


if (fl,4)^ceA 


(horizontal move) 


(c) 


l(a/)r 


hr Mr 


if b G 7(a) 


(vertical move) 



As we can see from this definition, an empty tree is accepted if and only if Q l~l Qf ^ 0. 
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5.2 Locality 

As follows from the informal description of the operation of a bottom-up tree automaton, the 
actions done in one branch of a tree are independent from the actions performed in another 
branch. Intuitively, the order in which individual leaves are processed should not matter until 
their branches are folded up to a common ancestor. 

By definition, an automaton's run is sequential; there is no parallelism allowed. Thus, even a 
deterministic string tree automaton can produce different runs on the same tree. At each point 
there may be a choice of multiple labels the next step can be applied to. However, this fact does 
not really qualify as non-determinism, because this choice does not affect the success of the run. 

Proposition 1. Consider a non-final tree in a successful GNFSTA run. Then for any of its leaves 
a step can be applied to that leaf that belongs to a (probably different) successful run. 

Proof. Let A = (£, Q,Qf, Qo, A, 7) be a GNFSTA that accepts tree t . Let t £ T(Q U {/}) be a 
non-final tree in the run of A on to- to \-%- 1 (qf/), where n > 1, qj £ Qf. Now, if t consists of 
only one label, the statement is trivial. If t contains more than one label, it can be represented 
as t = (x(w)y), where w £ (Q U {/})* is one of its leaves. We want to prove that for any such 
representation, t (xvy) hjr (<?//} for some v such that: 

v = (w'), where w' £ (Q U {/})*, or v = q, whereqeQ. 

This is proven by induction on n. 

Basis n = 3. The smallest number of steps for a tree consisting of more than one label is three: 

«?/» br<9>br <*>/«> br <«//>• 
In this case, v = q and the statement is true. 

Induction Suppose the statement is true for n steps; we need to prove it for n + 1 steps. Since 
t (qf/), there exists t' such that t 1' (qf/). Consider all possible steps that can be 
applied to t. By the move relation definition, the step from t to t' can be done on the label that 
is either (w), or wholly contained within x or y. This gives us four possibilities for this step (and 
thus for t')\ 

(x(w)y) 1-3- (x'(w)y), (x(w)y) ^- (x(w')y), 

(x(w)y) (x(w)y'), (x(w)y) (xqy). 

If t' is one of those listed in the right column, then the induction holds immediately. Otherwise, 
we assume that t' — (x'(w)y) (the other case is fully analogous). 

Applying the inductive hypothesis to t' , we get t' hx- (x'vy) h^r (if/)- The move from t' to 
(x'vy) is done via one of the three GNFSTA steps with I = (x' and r = y) . Let I = (x; then the 
same step will apply to t: 

t = (x(w)y) (xvy). 

Now, by analogy, consider the step from t to t' . During that step, I is a prefix of (x and r = z(w)y, 
where z is an arbitrary string from (Q U {(, ), /})*• Let r = zvy; then the same step will apply to 
the tree (xvy): (xvy) h^- (x'vy). Thus, we get: 

the (xvy) hr (x'vy) hr (Qf/), 
which proves the inductive statement. □ 
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5.3 Automata with "pure" states 



A salient difference between string tree automata and the traditional finite state machines is that 
an FSTA's state set is a superset of the "input" alphabet E. In other words, in FSTAs all input 
symbols can serve as what corresponds to states in traditional automata — for example, they can 
occur in right hand sides of rules from A. 

This came out of the fact that in string tree automata, "traditional" states become input 
symbols during vertical moves. Therefore, there is not much conceptual difference between them; 
that's why we often call states state symbols. An FSTA has to accept both sets as its input. Then, 
to make life easier and notation simpler, we allowed our FSTAs to use both sets as states as well. 
As a side effect, this has permitted in certain cases simple and elegant implementations, such as 
the one from Example 1. 

However, there is nothing special about using input symbols as states. In fact, any (GN)FSTA 
can be trivially rewritten so as not to use them. This rewrite preserves determinism of the 
automaton; i.e. whether it is deterministic, non-deterministic, or generalised. 

Indeed, let A = (E, Q,Qf, Qo, A, 7) be an FSTA. Consider a set of "complementary" symbols 
E such that there is a unique symbol a G E for each a £ E, and E n Q — 0. Let the new state set 
Q' be Q U E. Let us also extend the complementation operator to Q so that q = q if 9 S Q \ S. 
Then, for each rule from A we add one or two rules to (initially empty) A' as follows: 



(Note that when r E, the set under the union sign is effectively a single-element set.) For a 
GNFSTA, we also build the new automaton's 7' as 



This procedure simply creates a duplicate set of input symbols, which can be used as states, and 
then replaces input symbols with their duplicates in every place where they were used as states. 
Note the following properties of the automaton A' = (E, Q' , Qf, Qo, A', 7'): 

• no rule (horizontal or vertical) can produce a symbol from E; 

• no rule takes a symbol from E as its state symbol; 

• neither the starting nor the final state sets contain symbols from E. 

Such automata will be called automata with pure states. 

Simple substitution by FSTA definition reveals that L(A') = L(A). Also, nothing in this 
procedure changes the deterministic properties of the automaton. 

5.4 Equivalence of deterministic and non-deterministic automata 

Now, we are ready to show that deterministic, non-deterministic, and generalised automata recog- 
nise the same class of languages. For this, it is sufficient to show that for any GNFSTA there 
exists an equivalent (i.e., recognising the same language) DFSTA. Moreover, as follows from the 
previous section, GNFSTAs can be assumed to be automata with pure states. The overview of 
the proof is presented in Figure 3. 

Theorem 1. For any GNFSTA with pure states over E there exists an equivalent DFSTA over 



This statement can be proven using the same technique as in the classical (string) automata 
theory. We only give an informal sketch of the proof here, referring the reader to Appendix A for 
the complete proof. 




If' (9) = 70?) for all q e Q; 
i{a) = for all a e E. 
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Figure 3: Equivalence of different kinds of string tree automata 



Let ^ be a GNFSTA (S, Q, Q f , Q , A, 7) with pure states. We construct DFSTA A' = 
(E', Q', Q'j, q' , A'), using the set of all subsets of Q as its state set. All possible outcomes of 
rules from A with identical left hand side will then be lumped together into one set; this set 
(which is a single state in the new automaton) will be assigned to the corresponding deterministic 
rule in A'. 

Firstly, however, one must remember that, according to the FSTA definition, the state set must 
contains all the symbols from the alphabet. We could satisfy this by letting Q' = 2® U X, but this 
would complicate the proof unnecessarily. Instead, we let the new alphabet be the set of all single- 
element sets, containing symbols from the original alphabet: £' = {{a} \ a G S} = lJ a6S {{a}}- 
Then we note that the natural bijection between £ and X' implies a one-to-one correspondence 
between T(X) and T(S'), so the two automata A and A' operate in fact on the same trees (with 
renamed alphabet symbols). 

Now, let Q' = 2 Q ; Q' f = {q 1 e Q' \ q'CiQf ^ 0}; q' — Q . Because A' does not have vertical 
rules, their functionality has to be incorporated in horizontal rules. To see how this can be done, 
consider the following example that shows excerpts from a GNFSTA run on a tree fragment: 

abc abc 

I hr I hr q a2 /abq 2 c 

def qi I 

to = (ab(def)c) t\ = (ab(gi/)c) t 2 = (go2/abg 2 c) 

Here, (702 is one of the starting states, and q 2 G 7(51). Imagine an arbitrary leaf label in a GNFSTA 
run on some tree (looking at t 2 as an example). Suppose it's ready for a horizontal move (i.e., 
contains /). What symbols can it be composed of? 

• The symbol on the left hand side of the / can be either an initial state, or a result of some 
horizontal rule. 

• The symbols on right hand side of the / can be either original input symbols from S (such 
as abc), or results of some vertical rule (such as q 2 , which was produced by the rule q\ — > q 2 
from 7). 

If vertical rules were not allowed, then q\ would have squeezed into t 2 in the place of q 2 . Thus, new 
horizontal rules from A' have to apply 7 to all right hand side symbols that result from vertical 
moves, prior to looking up an appropriate horizontal rule from A. Roughly speaking, A'(p,q) = 
A(p, 7(g)), if q came via a vertical move from a subordinate node; and A'(p,q) — A(p,q), if q is 
an input symbol that originally belonged to the node being processed. 
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Fortunately, distinguishing between these two cases is very easy, because our GNFSTA is an 
automaton with pure states. Indeed, in such an automaton 7 is not allowed to produce symbols 
from E, which is where all the original input symbols come from. Thus, if a symbol in a label 
belongs to S, it must be an original one; otherwise, the symbol has to be a result of a vertical rule. 
Appendix A gives a full proof why automaton A' built as described above is indeed equivalent to 
A. 

5.5 Finite string tree automata and classical automata 

Due to their historical background, FSTAs share much in common with classical string and tree 
automata. In a sense, FSTAs can be thought of as a generalisation of classical automata: an 
FSTA can be applied wherever a string or tree automaton is used. The aim of this section is to 
discuss the relationships between different kinds of automata. 

5.5.1 String automata 

Finite automata (FA), also called finite state machines (FSM), are widely used in automata and 
formal language theories. In this paper, we often call them "finite string automata" to explicitly 
distinguish them from tree automata. A finite automaton is a tuple (S,Q,Qf,qo,5), where E is 
an alphabet, Q is a finite state set, Qf C Q is a set of final states, qo G Q is the initial state, and 
S is a map from Q x E to 2^. A finite automaton works on strings from S*, starting in the state 
qo and applying rules from 6 to its current state and the next symbol from the input string to 
determine its next state. The language recognised by an automaton is the set of strings that it 
accepts. 

The move of an FA is, therefore, defined exactly like the combination of steps (a) and (b) of 
the FSTA move (see Definition 3). This leads us to the following statement. 

Proposition 2. If L is the language recognised by a finite automaton {Y>,Q,Qf,qo,8), then (L) 
is recognised by the FSTA (E, Q U E, Qf, qo, S), where S is naturally extended so that 8(q±, q 2 ) = 
for all qi,q 2 G Q. 

As we remember, (L) = {t G T(E) | t = (s) for some s G L}. Indeed, for single-label trees 
this statement follows immediately from the automata definitions. Trees containing more than 
one label are not accepted by this FSTA for the following reason. Suppose t G T(E) has more 
than one label. Then there exist a pair of labels Zi and I2 such that h contains l 2 : h — (xl 2 y). 
When I2 has been fully processed, it becomes (92/), where q 2 G Q by definition of the FA (because 
qo G Q and ImS G 2*5). A vertical move then injects q 2 into l\: (xq 2 y. The automaton will stop 
at q 2 because there is no suitable rule in S. 

A number of finite automata can be combined together to form an FSTA that accepts different 
string languages in labels depending on their position in the tree, where position is determined by 
the topology of the tree and by the information in other labels. This will be discussed in more 
details in Section 6.1. 

5.5.2 Tree automata 

In this paper, these are usually referred to as "classical tree automata" (CTA). The following 
introduction is largely borrowed from [13]. 

Trees in the classical tree automata theory are called terms and are composed of ranked symbols. 
A ranked alphabet is a finite set of symbols, in which each symbol is associated with a whole non- 
negative number, called arity. The arity of a symbol / is denoted as Arity(/). Symbols of arity 
0, 1, 2, . . . ,p are called constants, unary, binary, . . . , p-ary symbols, respectively. 

The set T(T) of terms over the ranked alphabet T is the smallest set defined by: 

• / G T(T) for any constant / G T; and 

• if / G T, p = Arity(/) > 1, and t u . . . ,t p G f(T), then f(h, . . . ,t p ) G f{T). 
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For example, consider a ranked alphabet T consisting of a binary symbol + and two constants 1 
and 2: T = {+(, ), 1, 2}. A term +(1, +(1, 2)) represents the following tree: 

+ 



A term t G T(T) may be viewed as a finite ordered labelled tree, the leaves of which are labelled 
with constants and the internal nodes are labelled with symbols of positive arity. The number of 
children a node has must be equal to the arity of the node's symbol. 

Consequently, a term from T(T) can be represented as a string tree over the un-ranked alphabet 
E = T . This is illustrated by the following recursively defined injective map t : T(J-) — > T(E): 

• r(a) = a (for a constant symbol a); 



r(f(h 



>**)) = (fr(h)---T(t p )). 



For instance, the term +(1,+(1,2)) pictured above would map to (+(1}(+(1)(2))). 

A non-deterministic classical finite tree automaton (NCFTA) over J 7 is a tuple A = (Q,T,Qf,A). 
Q is an alphabet, consisting of constant symbols called states; Qf C Q is a set of final states. A 
is a set of rules of the following type: f(qi, . . . , q n ) — > q, where n — Arity (/), q, qi, . . . , q n £ Q. 

Like our string tree automata, a classical bottom-up tree automaton also starts at the leaves 
and moves upwards, associating a state with each subterm. If the direct subterms u\, . . . ,u n of 
term t = f(ui, ... , u n ) are assigned states q\, . . . ,q n respectively, then the term t will be assigned 
some state q given that f(qi, . ■ . , q n ) q E A. The move relation is thus defined in T^UQ); 
its full formal definition is outside the scope of this paper and can be found in [13]. 

Tree languages recognised by classical automata happen to be recognisable by string tree 
automata, as demonstrated by the following proposition. 

Proposition 3. Let A be an NCFTA (Q, J 7 , Qf, A), recognising language L C T(T), and let 
E be the (non-ranked) set of symbols of the alphabet T . Then there exists an NFSTA A = 
(S, Q,Qf, qo, A) that recognises the language L = t(L). 

Proof. The automaton A is constructed step-by-step, by taking transition rules from A and pop- 
ulating Q and A as described below. 

Let A = 0, Qf — Qf, and Qo = Q U {qo}, where qo is a new state symbol (qo Q). Let us 
enumerate the rules in A, indexing them from 1 to n. 

For each i from 1 to n we do the following: assuming that the i-th rule in A is fi(qn , ■ ■ ■ , q_i Vi ) — ► 
qi, pi> 0, let 



Qi = Qi-i U {qn, . . . ,qi Pi }, A, = A 4 _i U < 



(<7o,/i) 

(qn,qn) 

{qi,Pi-i j qi,pi-i ) 
(qipi j q%Pi ) 



qn, 

Qi2, 



> ■ 



where qij (1 < i < n, 1 < j < pi) are new unique states (qij $ Qi-\). 

Finally, let Q — Q n , A = A„. The language, recognised by the resulting NFSTA A = 
(E, Q,Qf, qo, A), is exactly (up to the mapping r) the language recognised by the original NCFTA, 
which can be proven by applying the relevant definitions. □ 
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5.5.3 Vertical move and classical string automata 

As already discussed, the horizontal move of a string tree automaton is an exact copy of the step 
of a traditional string automaton. Let us now investigate the vertical move from this point of 
view. 

Consider string a be and tree ((((a)b)c)). The tree is arranged so that one vertical move is 
required for an FSTA to consume each symbol, bringing the analogy between an FSTA's vertical 
movement and a step of a string automaton. Note the seemingly excessive extra pair of angle 
brackets in the middle; they actually simplify things, as will be shown below. Despite the fact 
that the actual moves are quite different -for example, an FSTA has to start with its initial state 
at each symbol, while a conventional FA only starts with q at the beginning of a string — any FA 
can be implemented as an FSTA working with "vertical" tree representations of strings. 

Let £ be an alphabet. We shall call a tree t G T"(E) a vertical tree representation of a string 
s G £*, if 

s = a\---a n (ai, . . . ,a n G E, n > 0) and t ={{■■■ ()ai)a 2 ) ■■■ a n ) . 

n+1 

The vertical tree representation can be obtained as the image of the function ui : £* — ► T(£), 
recursively defined as 

• u(s) = () and 

• Lu(sa) = (lu(s)) ■ (a) (where a G E, s G £*). 

Proposition 4. For any FA over E there exists an FSTA over E that recognises the vertical tree 
representation of the FA 's language. 

Sketch of proof . Let FA — (H,Q,Qf,qo,S) be a finite string automaton, recognising language 
L(FA) C £*. We want to construct a finite string tree automaton A that recognises the language 
lu{L(FA)) C T(E). 

To build A, we need an additional set of complementary states Q such that there is a unique 
state q G Q for each q G Q, where q £ Q, q ^ E. We also need a new unique initial state q' . Each 
step of FA will map to four steps of A: horizontal, vertical, initial state assignment, and horizontal 
again. The first horizontal move will each time produce a complementary state; vertical move will 
deliver it one level up; and the following two steps will convert this complementary state to its 
counterpart, preparing for the next horizontal move. This is illustrated below by a run of some 
FA on the string ab and a corresponding run of an FSTA on w(ab) = (((}a)b): 

(5o, ab) hra- (<?i, b ) hr fe,e) 

b b 

I I b 

a br , a h*- I br 

%l 

Let A = (E, Q', Q' f ,q' , A), where 
Q' = QUQU{^}UE, 

and the rule set A is built as follows: 

• for each rule (q, a) — > p from 6 we add (q, a) — > p to A; 

• for each state q G Q we add (q' , q) — > q to A; 



br I h|- 9o/9ib br 9i/b br W 



9o/a qi I 



n i _ f Qf if <?o t Qh 

^ f I Qf U {«{,} if go G Q/, 
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• A also contains the rule (q' Q , q' ) — > <jo : 



A= |J {(g,a)^p} U 



U {G?o>9)-«} U {(4<Zo)^9o} 



(g,o)->p £ (5 



<?eQ 



Let L(»4) be the language recognised by A. To prove the equivalence of the two automata, we 
must show that 

(i) s e L(FA) implies that lu(s) 6 L(A); and 

(ii) for any t £ L{A) there exists s e L(F^4) such that w(s) = i. 

Statement (i) can be proven by considering the run of FA on s and building the corresponding 
run on w(s), replacing each original step with four new as illustrated above. The result is then 
shown to be a valid successful run of A. 

To prove statement (ii), consider the sequence of tree sets T n , composed of trees that produce 
a successful run in n steps: T n = {t e T(Q' U {/}) | t \-%- (q'j), q'j £ Q'f}- Induction on n with 
4 step increments shows that all trees from L(A) are contained in T^+i (k = 0, 1, . . . ) and that 
each tree from L(A) (~1 T 4fe+1 can be represented as u)(s), where s G L(FA). □ 

5.5.4 Boolean closure 

According to the formal language theory, the class of recognisable (regular) string languages is 
closed under union, under intersection, and under complementation [14]. The same applies to 
the classical tree languages [13]. Quite naturally, string tree languages are no exception and also 
exhibit the same Boolean closure properties. The proof from classical literature can be easily 
transferred to our case. Thus, only the main idea of the proof is presented here. 

To prove that languages L1UL2 and L1IHL2 are recognisable, we take the FSTAs that recognise 
L\ and L 2 and unite their state and rule sets. The final state sets for the new automata are taken 
as the union and intersection, respectively, of the original final state sets. 

To prove that and T(X) \ L is recognisable, we consider a complete FSTA that recognises L. 
Any FSTA can be made complete, as shown in Corollary 1 in Appendix A. Complementing its 
final state set produces the automaton that recognises T(S) \ L. 

5.6 Discussion 

We have compared regular string trees with regular strings and classical trees. The most important 
conclusion is that the former can "grow" in two directions: horizontally and vertically, combining 
the properties of the latter two. Classical trees, for instance, can regularly grow downwards, but 
the degree of each node is fixed and depends on the node's symbol (label). In string trees, a node 
can have a regular set of child trees. 

The noted similarities between traditional regular sets and horizontal and vertical arrangements 
of nodes in regular string trees hint that usual regular properties are likely to hold. Pumping, for 
example, applies to trees growth both in width and in depth. 

6 Regular languages, grammars, and expressions 

In this section, we discuss an alternative approach to the definition of recognisable tree languages, 
based on the concept of grammars. Again, this goes in parallel with, and derives from, the classical 
theories. 

6.1 Regular tree grammars 

In contrast with an automaton, which is an accepting device, a grammar is a generating device. A 
grammar defines set of rules, which generate objects (strings or trees) from a pre-defined starting 



16 



point. The language denned by a grammar is the set of all objects that can be generated using 
the rules of the grammar. 

A string tree grammar is very similar to an extended context-free grammar (an extended CFG 
is a CFG which allows regular expressions, rather than simple sequences, in the right hand sides 
of its rules; ECFGs have the same expressive power as normal CFGs). Before proceeding with the 
tree grammar definition, we need to briefly introduce "string-regular" expressions, which we shall 
be using extensively. 

A regular expression (RE) [14] is a mechanism of formal language theory that describes regular 
languages. A regular expression over S defines a (regular) subset of £*, using symbols from S 
and three regular operators. The set RE(T,) of regular expressions over X is defined as follows: 

• eG RE(T,); 

• £ C RE(T,); 

• if r 1; r 2 G RE(E), then r\r 2 G RE(E) (concatenation); 

• if ri,r 2 G RE(E), then n|r 2 G RE(T,) (union); 

• if r G RE(E), then r* G RE(T,) (iteration, or Kleene star). 
The language defined by a RE r is denoted as [r] . 

Definition 6. A regular string tree grammar (RSTG) is a tuple (£, A, S, R), where: 

• £ is a finite alphabet, 

• A is a finite set of non-terminal symbols (N (~l £ = 0), 

• 5 G A is an axiom, or starting non-terminal, and 

• i? is a set of production rules of the form n — > (r), where n G A, r G RE(Y> U A). 

The only difference between RSTGs and extended context-free grammars is the pair of an- 
gle brackets in the right hand sides of production rules. They are there to indicate that every 
application of a production rule inserts a pair of angle brackets into the generated string. 

The derivation relation , associated to a regular tree grammar G — (£, A, S, R), is defined 
on pairs of strings from (S U A U {(,)})*: 



Theorem 2. A string tree language is recognisable (by a finite string tree automaton) if and only 
if it is generated by a regular tree grammar. 

Proof. Grammar — > automaton: Given some regular tree grammar G = (£, A, S, R), we show 
how to build a generalised finite tree automaton with pure states A — (X, Q, Qf, Qa, A, 7) which 
recognises L(G) — the language generated by G. 

Let k = \R\ be the number of production rules in R. Each rule n G R has the form m — > (e,), 
where m is a non-terminal from A and e, is a regular expression over S U A. According to 
the automata theory, for each regular expression there exists a finite (string) automaton that 
recognises the same language. Let FA^ be a finite automaton equivalent to ej. Consider FAi = 
(S U A, Qi, qoi,Q fi , Si) for i = 1, . . . , k, where: 

• Qi is a finite set of states, disjoint with S U A; 

• qoi is the initial state, qoi G Qf, 




= IXr, X -> e G R 
= l{x)r, x G [e], 



and 



where u, v, I, r G (S U A U {(, )})*, X G A, 



x G (SUA)*, 



e G RE(ZUN). 
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• Qfi is the set of final states, Qfi C Q,; 

• (5, is a map from Qi x (S U AT) to 2*3 ; or — equivalently — a set of rules of the form (q, a) — ► p, 
where q,p G Q, aeSUJV. 

Since all these automata are independent (apart from having a common alphabet SUAf), we can 
assume that their state sets Qi, ■ ■ ■ ,Qk are pairwise disjoint. 

We now build our GNFSTA A = (£, Q,Q/, Qo, A, 7) as follows: 

Q = £UiVUQiU---UQ fe 

Qo = {<7oi, • ■ • 7 9ofe} 
A = <5i U • • • U 4 

7 = (J U 

i=l,...,fc9 G( 3/' 

Q/ - 7" 1 ({S}) = (J 0/* 

i: S-»(ei>eJJ 

To prove the equivalence of G and .4, we show for any tree t G T(S U A 7 ") that it is generated by 
G if and only if it is accepted by A. The induction is done on the number of labels in t. 

Basis n=l. Then t = (s), where s £ (S U A 7 )*. 

==>: If f is generated by G, then there is a rule S — * (e^) in i? such that s £ [e^]. By definition 
of A, s is accepted by FA; = (E U N, Qi, qoi,Qfi, Si), where Qi C Q, q i £ Go, Q/i C Q f , S { C A. 
Therefore, (s) is accepted by .4 as follows: 

(*> br (W*> br (<?//) 

for some C Q/. 

^=: If (s) is accepted by A, then there is a run 

(*) br (lo/s) br br ■ ■ ■ br («//). 

where q £ Qo! s — a i s i, s i = a 2 s 2, • ■ • ', {lj-ii a j) ~ > Qj £ A; and qf G Q/. By definition of 
Qo, Qa = <Zoi £ Qi f° r some i. Because Qi, ■ ■ ■ ,Qk are pairwise disjoint (and by definition of A), 
the rule (qo,a\) —> q\ belongs to Si, which implies that qi £ Qi. The same reasoning can then 
be applied to qi , q%, ■ ■ ■ , showing that all the horizontal rules applied belong to the same Si and 
qf G Qfi- Thus, s is accepted by the finite string automaton FA*. (If s is empty, then q n G Qf, 
therefore qo £ Qfi for some i, so s is accepted by FA,.) 

Notice that qf belongs to both Qfi and Qf, which by definition of Qf implies that the rule 
S — > (e») belongs to i?, where e, is the regular expression equivalent to FA,. Thus, the grammar 
G generates ([ei]), and in particular, (s). 

Induction By inductive hypothesis, we assume that any tree t' G T(S U N) that has to labels 
(to > 1) is generated by G if and only if it is accepted by A. We want to show that the same holds 
true for any tree t with m + 1 labels. Let us select a leaf label in t: t = l(s)r, where s G (E U AT)*. 

==>: If i is produced by G, then there is a tree t' = Inr also produced by G such that neJV and 
there is a rule r, : n — > (e) in i? such that s G [e]. Thus, by definition of A, 

l(s)r br i(W*>r br *<?//>»■ 

for some qf G Q/,. Then we notice that 7 contains the rule qf — ► n, which implies l(qf/)r br Inr. 
By induction, Znr = t' is accepted by A, therefore t is accepted as well. 

-<=: If i = Z(s)r is accepted by A, then by Proposition 1 (locality) there is a successful run of A 
on f: 

l(s)r br %o/s)r br i<9i/*i)r br ■ ■ ■ br br *</V br (9//), 
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where q £ Q ; s = msi, Si = a 2 s 2 ,...; (qj-i,aj) —> q 3 £ A; q z —> q' £ 7; and q f £ Q f . By 
analogy with the inductive basis, we can see that qo = qoi for some i; all rules (qj—i,aj) — > gj 
belong to & for the same i; g z G Q/i; and q' — m (the non-terminal in the left hand side of the 
i-th rule in R). Thus, s is accepted by FA*, equivalent to the regular expression e, in the rule 
n t — > (e,) e i?. 

Consider the tree Intf. It is accepted by A (because Irnr — Iq'r and Iq'r is accepted), therefore 
by the inductive hypothesis Imr is generated by G. Then, all trees Z([ej])r are also generated by 
G; and this includes t = l(s)r, because s £ [ej]. 

Automaton — > grammar: Given a finite tree automaton with pure states „4 = (E, Q, Q/, go, A), 
we want to build a regular tree grammar G — (E,N,S, R) that recognises L(A). 

Let N = Q \ S U {S*}, where 5 is a symbol not in Q or S. For each n, £ N \ {5}, let us 
build an automaton FA^ = (Q, Q \ E, {rij}, A), whose input symbols are all the states of ^4, 
and whose states are the "pure states" (Q \ E) of A. The transition rules and the initial state for 
each FAi are taken directly from A. Let also FAg = (Q,Q \ T,,Qf,q 0} A). All these automata 
only differ in their final state sets. 

According to the automata theory, for each FA^ there exists an equivalent regular expression 
e, over Q*. Since Q C E U N, each a is a valid regular expression over (E U N)* as well. The 
same applies to es, equivalent to FA5. 

Now, let R = {j t {n t -> (e 4 )} U {5 -> (e 5 )}. The resulting grammar G = (E, N, S, R) is a 
regular string tree grammar, which recognises the language of automaton A. □ 

6.2 Vertical concatenation 

Although the two tree operators, concatenation and encapsulation, are sufficient to build all pos- 
sible string trees out of basic elements, it may occasionally be useful to link trees at places other 
than their roots. For example, a tree can be built from root downwards by repeatedly attaching 
other trees at its leaves. In tree automata theory this type of tree linking is done by replacing a 
symbol in the first tree by the root of the second tree (Figure 4). This is readily transferable to 
string trees. 




Figure 4: Concatenation through variable x 



Definition 7. Vertical concatenation of two trees u,v £ T(E) through symbol x £ E, denoted 
u - x v, is the tree derived from u by replacing all occurrences of x in it with v. 

Vertical concatenation can also be defined recursively as follows: 

(> -x t = () (u-v)- x t= (u - x t) ■ (v - x t) 

(a) - x t= (a), for a ^ x (u) - x t= (u - x t) 

(x) - x t = (t) where a, x £ E; u,v,t £ T(E). 

Note that vertical concatenations have no neutral element, because they always insert at least a 
pair of angle brackets: (axb) - x (x) = (a(x)b). By analogy with tree automata theory, symbols 
which are used for vertical concatenation will be called variables to help us distinguish them from 
other symbols in the alphabet. Normally, symbols which can occur in the actual trees being 
processed are not used as variables; the latter are chosen from some disjoint set. 
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6.3 Regular tree expressions 

Combination of string regular expressions and classical tree regular expressions yields the following 
definition. 

Definition 8. The set RSTE(E) of regular string tree expressions over E is defined as follows: 



• () G RSTE(T,); 




• E c RSTE(E); 




• if n,r 2 G RSTE(E), then n|r 2 G RSTE(E) 


(union) ; 


• if n,r 2 G RSTE(E), then n • r 2 G RSTE(T,) 


(horizontal concatenation) ; 


• if n,r 2 G RSTE(T,) and ir G E, then n ^ r 2 G RSTE(T,) 


(vertical concatenation); 


• if r G RSTE(E), then r* G RSTE(T,) 


(horizontal iteration) ; 


• if r G RSTE(Z) and ir G E, then r** G RSTE(Z) 


(vertical iteration). 


Note that encapsulation, not mentioned above explicitly, is also a regular operator: (r) = (a;) r 
for x G E. 



The language described by an RSTE r is defined analogously to the classical theories and is 
denoted as jr]. Regular string tree expressions have the same expressive power as regular string 
tree grammars and finite automata. 



6.3.1 Regular expression matching and substitution 

Regular expressions play a major role in many text processing tools, such as 'sed', 'awk', 'perl', 
typically found on Unix systems [3]. They allow matching and selecting pieces of text that can 
then be re-combined to produce desired results. First, the input text is matched against a regular 
expression. When a match is found, sub-expressions of that regular expression are associated with 
the corresponding fragments of text that they match. These fragments can then be extracted 
simply by referring to the desired sub-expressions. 

In the traditional tools, regular expressions are represented in a parenthesised infix form. 
Round brackets in expressions play a dual role: they group regular operators and also identify 
sub-expressions that will be used for text extraction. These sub-expressions are then referred to by 
their numbers (counted left-to- right according to their opening brackets). For example, expression 
'(press|push|hit|strike) space (key|bar)', applied to the phrase 'push space bar', would result in a 
successful match, selecting 'push' and 'bar' as fragments 1 and 2, respectively. 

Below are a few examples of string tree regular expressions. We assume that the variables used 
in the expressions cannot occur in actual trees; in other words, variables (denoted below as X, 
Y, Z) are unique symbols, added to the input alphabet E. In our notation, operators have the 
following priorities: 

* * x (highest priority) 

'x 

(lowest priority) 

We also use the symbol a as a convenient shorthand for the union of all single symbol trees without 
variables. 

The expression (a\{X))* * x matches any tree. This expression will be denoted as '9', assuming 
that the variable X is not used anywhere in the expression that contains 9, as it may cause 
unwanted interference. In other words, the scope of X in 9 is restricted to 9. An expression 
that matches any tree with some variable(s) will also be handy, e.g.: Qxy — (a\(X)\{Y)\(Z))* * z > 
where Z has a restricted scope. 
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Figure 5: Selecting a subtree with a regular expression 

Figure 5 shows a regular expression that selects a subtree located tightly between two subtrees 
labelled left and right (in this order) somewhere below the root of the input tree. The requested 
subtree is selected by the second pair of round brackets. The first pair is needed for grouping 
operators and (as a side effect) selects the parent tree of the one we are looking for. 

What if the same regular sub-expression matches more than one fragment? This is largely an 
implementation issue and is up to a particular application. Let us consider potential solutions. 

To begin with, we need to separate two cases when an expression can match more than one 
fragment: ambiguity and multiplicity. Ambiguity happens when a sub-expression can match one 
out of several alternatives, for example, in an expression like r* ■ (r) • r*. This is the same as 
r* — a concatenation of zero or more trees described by r. The (r) in the middle can match any 
tree in the sequence — the first one, the last one, or a tree somewhere in between. Multiplicity 
happens when the same sub-expression matches several fragments simultaneously, as in (r)*. This 
describes the same sequence as the previous example, but this time (r) matches all the trees that 
compose that sequence. By definition of iteration, there are in fact infinitely many copies of r in 
that expression, each copy matching at most one tree; when these copies are represented in the 
formula by a single sub-expression, it happens to match all those trees simultaneously 

Ambiguity is traditionally solved by imposing longest-match or shortest-match rules, or by 
forbidding ambiguous regular expressions. In case of multiplicity, usually only the first or the last 
matching fragment is selected. However, as opposed to the text case, a tree processing application 
has the benefit of hierarchical structure: several fragments can be combined into one object simply 
by adding another level of hierarchy. Thus, a regular sub-expression that can potentially exhibit 
multiplicity may be associated with a tree whose subtrees are the fragments matched. 

7 Potential applications of the string tree automata theory 

The data model described in this paper enables processing of structured information in the same 
way as it is done today with non-structured textual data. In many text processing utilities and 
applications, data records are matched against one or more string regular expressions. Some parts 
of these expressions are marked. When a match occurs, each marked part is associated with the 
piece of text that matched that part within the regular expression. These pieces of text, extracted 
from the data record, are then used according to the application's needs. They can be combined 
together and with other pieces (e.g., string literals) using concatenation, or they can be processed 
further as simple strings. The same basic processing scenario applies to our hierarchical data 
model (as illustrated in Figure 6). 

An information processing application which uses the proposed data model needs to implement 
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Figure 6: Tree manipulation using regular string tree expressions 



a set of operations on trees that can be used in formulae (the right box in Figure 6). These 
operations will likely include the basic tree operators (encapsulation, concatenation) and some 
traditional string functions (character translation, table lookup, etc.) Note that many of these 
functions can be expressed using regular expression matching, concatenation, and string literals — 
thus, they do not extend the model, but merely act as convenient shortcuts. Tree processing 
primitives (such as sorting of children) can also be included in the application's operational model. 

The application may also employ some execution model which specifies how (in what order, 
on what conditions) tree operators are invoked. For example, the execution model can provide 
variables for storing trees. In one of envisaged scenarios, a tree transformation utility may treat 
the transformation it is performing as a sequence of rules, each consisting of a condition and an 
action. The utility would execute rules sequentially by checking the condition and, if necessary, 
doing the corresponding action. Actions would normally consist of concatenating trees taken from 
variables and constant expressions, and storing them back into variables. 

As suggested by the practices of using string regular expressions in text manipulation tools, in 
such scenarios regular expressions on trees can play a dual role: 

• to serve as a rule condition by telling whether a particular tree matches a pattern or not; 

• to "extract" parts of a tree. 

Types of applications that can benefit from this data model include: 

• stand-alone processing tools (generic or specialised), such as HTML or XML processors or 
MARC (Machine Readable Catalogue [15]) converters; 

• programming languages that include hierarchical data types; 

• information retrieval; 

• query languages. 

8 Conclusions 

The "string tree" data model introduced in this paper provides a simple algebraic notion for the 
hierarchical information structures. The model is based on two classical works: theory of automata 
and formal languages; and tree automata theory. The classical notions are extended and combined 
together to provide a powerful solution for today's information processing needs. The resulting 
model combines the expressive power of its parents: strings, trees, finite automata, and regular 
languages defined by the classical theories can be expressed in the proposed model. It is shown 
that many of the properties of classical models apply here as well. Therefore, it is reasonable to 
expect that even those techniques that are used in classical theories but not considered explicitly 
in this paper, can be formulated and re-used in terms of the proposed model. 
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The most important conclusion is that processing of tree-like hierarchical data can benefit from 
the power of regular expressions in the same way that simple text processing has benefited from 
them to date. It is shown that regular expression matching on trees can be done by finite tree 
automata, which fit into linear memory and space constraints. 

The formal description of the proposed data model supplies a basis for building custom, 
problem-oriented, as well as generic solutions for data retrieval and processing. These solutions 
can build upon additional operators introduced on trees. 

The model offered is simple, consisting of only three operators: concatenation, encapsulation, 
and regular expression matching. The regular algebra contains five operators. Many processing 
functions, including frequently used convenience operators (such as, for example, extraction of n- 
th subtree) can be expressed using the basic set of operators without the need to extend the formal 
model. This puts the proposed model in a favourable position with respect to existing solutions, 
which are often over-complicated. Currently, different bits of tree manipulation functionality (such 
as execution model, string functions, tree operators, extraction of sub-records) are often bundled 
together and depend on each other. This makes existing models cumbersome and inflexible. Also, 
if, for instance, such model is incorporated into a programming language to provide it with a "tree" 
data type, that language is likely to already offer string and numeric processing. This results in 
duplicate functionality. 

Finally, the model described in this paper provides a unified representation of both information 
trees and character strings, which makes it suitable as a single data type for processing of both 
structure and content of hierarchical documents. A simple scenario is presented showing how this 
model can be used in information processing applications. This scenario is patterned after the 
current usage of traditional regular expressions in unstructured text processing. 

Further research can be centred along the following lines: 

• Use of non-string data types (e.g., numeric and Boolean), both in the data being processed 
and in the operational model. The present data model is capable of processing, say, numeric 
data if the application's operational model supports it (i.e., it contains arithmetic and con- 
version operators). However, feeding numeric data back to the regular expression matching 
engine is not obvious. This may be needed, for example, to extract a subtree by its number, 
where this number is determined dynamically — much like indexing an array by a variable. 
Of some relevance here is research on XML that has been investigating how data typing can 
be incorporated into XML schemata [16]. 

• Presentation of regular expressions. Tree regular expressions written in infix form (as in 
Figure 5 on page 21) look more complex and may be more difficult to compose than string 
regular expressions. On the other hand, this notation is also compact and close to the con- 
ventional syntax. Another possible representation may be that of the regular tree grammar, 
which has the advantage of being similar to SGML/XML DTD format. 

A Proof of equivalence of deterministic and generalised tree 
automata 

This Appendix contains the proof of Theorem 1 on page 11 that states that any generalised non- 
deterministic finite string tree automaton (GNFSTA) with "pure" states has an equivalent deter- 
ministic automaton (DFSTA) . The basic principle behind this proof is the same as the one used 
in the classical automata theory to prove the equivalence of deterministic and non-deterministic 
automata, namely subset construction. For an arbitrary GNFSTA, we build an equivalent DFSTA 
whose state set is the set of all subsets of the original GNFSTA's state set. A brief sketch of this 
proof is presented in Section 5.4, where the theorem is introduced. 

The requirement that the original GNFSTA be an automaton with "pure" states allows us to 
distinguish original tree's input symbols from states generated during the automaton's run. As 
shown in Section 5.3, any GNFSTA can be converted to an automaton with pure states. 
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In the next section, we introduce two auxiliary items: a map T and a relation <, which will 
be used in the proof. Then, the following section proves a lemma which incorporates most of the 
work. After that, the validity of the main result of Theorem 1 follows almost immediately from 
the lemma. 

A.l Preliminaries 

Let A be a GNFSTA with pure states over S: A = (£, Q, Qf, Qq, A, 7). As suggested above 
(and in the sketch of the proof in Section 5.4), we want to construct a suitable DFSTA A' = 
(E',Q',Q'f,q' ,A'), where Q' = 2^, and prove their equivalence. First, however, a few auxiliary 
concepts need to be introduced. 

Let T = T(Q U {/}), V = T(2 Q U {/}). These sets contain trees on which the move relations 
and are defined. That is, all intermediate trees which compose runs of A and A' belong 
to T and T", respectively. 

Because A has pure states, its "vertical move" function 7 produces empty set on all symbols 
from S. Let us redefine 7 on X and then extend it to 2^ as follows: 

7(a) = {a} for all a E E; 
7({<7i, • • • , Qn}) = 7(9i) U • • • U 7(<7„). 

We now define function V : T' — ► T', which applies 7 to all the input symbols in intermediate trees 
for A'. That is, if a label in the tree is yet "untouched" by the automaton (does not contain /), 7 
is applied to all symbols of the label. If a label is partly processed, 7 is only applied to the part 
on the right hand side of the slash (/). Remembering that symbols in T" are sets of symbols of 
T, we define T using string representations of trees from T' as follows. Let a denote a single state 
of T': a £ 2®. Let s denote an arbitrary sub-string of a tree from T", which does not start with 
/: se (2«U {(,},/})*, s^/x. Then, let 

T(e) = e T((s) = (T(s) T(as) = j(a)T(s) 

r()s) = )r( S ) T(a/s) = a/r(s) 

Note that by this definition, T(xy) = T(x)T(y), if y does not start with a slash. Also, if a tree t' 
is not just an arbitrary tree from T", but an intermediate stage of the automaton A 1 , each / in 
t' must contain a state from Q' immediately on its left. This implies, in particular, that for any 
representation t' = l'(s')r', it is true that r(f') = T(l')(T(s'))T(r'). 

Another thing that will be needed for the proof is a relation between T and T", which is induced 
by the simple "belongs to" relation between a set and its element. Let t e T and t' G T'. We say 
that t is an instance of t' (denoted t <it') 7 if: 

(a) t and t' have the same structure, i.e. symbols (, }, and / occupy the same positions in both 
trees; and 

(b) each Q-symbol in t belongs to the set of symbols in the same position in t' . 
Or, more formally, 

for any 

a e Q, a' e 2 Q , 

s e(Qu {(,),/})* 
s 'e(2«u {(,},/})* 

(where <^=> denotes logical equivalence). An immediate corollary is that xy < x'y' is equivalent 
to x < x', y < y' for any strings x, y, x', y' taken from the appropriate string sets. 



e <3 e 

as < a' s' a E a' and s <\ s' 

{s < (s' s < s' 

)s < )s' 4=4- s < s' 

/s <3 /s' s < s' 
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A. 2 Equivalence of DFSTA and GNFSTA 

Given a generalised non-deterministic finite string tree automaton A = (£, Q, Q/, Qo, A, 7) with 
pure states, let us define DFSTA A' = (£', Q' , Q' f ,q' , A') as follows: 

£' = {{a}|ae£}; Q' = 2 Q ; q' = Q ; Q' f = W G Q' \ q' D Q f + 0}; 
A' (a', b') = {c G Q | (a, 6) -► c G A, a G a', 6 G 7(6')} for all a', b' G Q', 

where 7 is understood in the extended sense. As follows from the definition of £', the tree monoids 
T(Yj) and T(S') are identical up to renaming of symbols: symbols from S map to the corresponding 
single clement sets in To prove the equivalence of A and A' , all we need to do is to prove that 
for any t G T(£) and its counterpart t' G T(E'), t belongs to L(A) if and only if t' belongs to 
L(A'). 

To do this, we firstly consider the set of all intermediate trees, reachable from t in n steps 
under \-%- , and, analogously, the set of all trees reachable from t' via f-^- . We shall prove that 
the first set contains exactly all instances of T-images of trees from the second set. 

Lemma 1. Let Steps^(i) denote such sets: Steps^(i) = {v G T(Q A U {/}) \ t[^v}. Then for 
any v G Steps^(i), v is an instance of V -image of some v' G Steps^/(t') (v <\ T(v' j), and for any 
v' G Steps^/(t') all of the instances ofT(v') belong to Steps^(t). 

Proof. Induction on n. 

Basis n = 0. Then Steps^(i) = {t}, Steps^(t') = {f}, T(t') = t' (because t' only contains 
symbols from £'), and t < t' by the definition of <]. 

Induction By the inductive hypothesis, the statement is assumed to be true for Steps™. We need 
to prove that: 

(i) for any v G Steps^ +1 (t) there exists v' G Steps^t 1 ^') such that v < T(v'); 

(ii) for any v 1 G Steps^t 1 ^') and v G T(Q U {/}), if v < T(v'), then v G Steps^ +1 (t). 

Let us start by proving (i). Since v G Steps" l +1 (t), there exists u G Steps^t) such that uy^-v. 
Therefore, by the inductive hypothesis there exists v! G Stcps^, (t') such that u < T(u'). Consider 
all possible moves from u to v. 

(a) u = l(s)r, v = l{q /s)r, where q G Q . 

Since u < T(u'), we can write u' = l'(s')r', where I < T(l'), s < T(s'), r < T(r'), and s' does 
not contain /. Consider now v' = l'(Qo/s')r'. By the initial state assignment move of A', 
v! \jr v ' (remembering that q' = Qq). At the same time, T(v') — Y(l'){Q /T(s'))Y(r'); since 
<7o G Qo, we get that v < T(v'). 

(b) u = l(a/bs)r, v — l(c/s)r, where (a, b) — > c G A. 

Then, vl can be represented as vl — l'(a' /b's')r' , where a G a', b G 7(6'); an d ^s,r are 
instances of r(s'), T(r') respectively. From definition of A' it immediately follows that 

A' (a', b') =d 3 c. Let v' = l'(c'/s')r', then u! ^ 1/ and v < T(v'). 

(c) u = l(q 1 /)r, v — Iq^r, where q\ — > q 2 G 7 or, in other words, g 2 G 7(91). 

Then, it' = l'(q'/)r', where 91 G q', I < r < T(r'). Consider 1/ = Z'g'r'; by the vertical 

move of A', v! y-^- v'. Now, q2 G 7(91) C 7(9'). Because r cannot start with a /, neither can 
r', therefore T(v r ) = T(l') / y(q')T(r'), and consequently, v < T(v'). 

Let us now prove (ii). Since v' G Steps^t 1 ^'); there exists v! G Steps" l /(i / ) such that v! [—r v'. 
Therefore, by the inductive hypothesis for any u G T(QU{/}), u<r(u') implies that u G Steps" t (t). 
We need to show that v <3 T(v') implies v G Steps" t +1 (t). Let us take an arbitrary v such that 
v <T(v'), and consider all possible moves from v! to v'. 
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(a) u' = l'(s'y, v' = l'(q' /s'}r'. 

Remembering that q' Q = Qo, and because v <l T(v') = T(l'){Q /T(s'))r(r'), we can write v 
as v = l{qo/s)r, where l,s,r are instances of r(/'), L(s'), r(r'), and qo G Qo- Let u = l(s)r. 
By the properties of T and the O-relation, u <\ L(u'), therefore by the inductive hypothesis 
u G Steps^(t). On the other hand, u \-%- v by the initial state assignment move of A. 
Consequently v G Steps^ +1 (t). 

(b) u' = l'(a'/b's')r', v = l'(c'/s')r', and A'(a',b') = c'. 

We know that v < r(v') = T(l')(c' /T(s'))r(r'), which implies that v = l(c/s)r for some 
1,0,3, r such that l,s,r are instances of T{V), L(s'), r(r'), and c G c' . Since c G A'(a',b') 
and by definition of A' (which we used to build our A'), there exist a G a', 6 G 7(6'), 
and rule (a, 6) — > c G A. Let w = l(a/bs)r. Using definition and properties of T, we get 
r(tt') = T(l'){a' /j(b')T(s'))r(r'), so that u < T(u r ). Using the inductive hypothesis and the 
fact that u \-^- v by the horizontal move of A, we can observe that v G Steps^ +1 (t). 

(c) u' = l'(q'/)r', v 1 = I'q'r', and q' G Q 1 . 

Since r' cannot start with a /, v < T(v') — T(l')j(q')T(r'). Then, v — Iqir, where I <\ 
r(Z'), r < T(r'), and q2 G 7(9')- Because 7 of a set is the union of 7-images of all elements 
of that set, there must be qi G q' such that q2 G 7(91)- Now let u = l{q\/)r. Because 
T(u') = T(l')(q' /)T(r') and q\ G g', we get u < T(w') and use the inductive hypothesis. Also, 
<Z2 £ 7(91) implies that u \-%- v by the vertical move of Therefore, v G Steps^ +1 (t). 

□ 

Now, it only remains to show that a final state is either reachable or non-reachable simultane- 
ously from both t and t' . In other words, we want to prove that (<?//) G Steps^(t) if and only if 
(q' f /) G Steps^(t'), where q f G Q f , q' f G Q' f . 

Proof. =>: Suppose (qf/) G Stcps^(t) for some qj G Q/ and n G N. By Lemma 1, there is 
v' G Stcps^/(t') such that (<?//) < r(u'). Therefore, we can say that T(v') = (q 1 /), where q' 3 qj. 
By definition of T, in this case v' — r(u'). Now, q' n Qf is non-empty (contains g/), which means 
that q' G Q' f . 

<=: Suppose v' = (q' f /) G Steps^,(t') for some q' f G Q' f and n G N. Again, T(v') = v'. Now, q' f 
being in Q'j implies that q'f C\Qf is non-empty; that is, there is q such that q G q'j and q £ Qf. 
Let w = (<?/) - a tree from T(Q U {/}). Since q G g^, we can see that v <\v' — T(v'). Then, by 
Lemma 1, v G Steps^(i); and, as we already know, q G Qf. □ 

Corollary 1. For any FSTA there exists an equivalent complete FSTA, i.e. such that any pair of 
states has a matching rule in A. 

Proof. The automaton A' built in the above proof is complete. □ 
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